Skip to content

[test] fix: shrink CP packed SFT functional model#4383

Merged
yaoyu-33 merged 1 commit into
mainfrom
yuya/fix-seqpacking-cp-oom
Jun 16, 2026
Merged

[test] fix: shrink CP packed SFT functional model#4383
yaoyu-33 merged 1 commit into
mainfrom
yuya/fix-seqpacking-cp-oom

Conversation

@yaoyu-33

Copy link
Copy Markdown
Contributor

Summary

The CP packed SFT functional test now creates a pretrain checkpoint and then loads it for SFT in the same pytest process. Using the full Llama 3.2 1B recipe shape makes the second DDP buffer allocation run near the H100 memory limit and can OOM before the SFT path starts.

This keeps the test coverage but shrinks only the test model shape for both the pretrain and SFT providers, so the checkpoint remains compatible while still exercising:

  • context parallel size 2
  • packed SQuAD SFT data
  • pretrain checkpoint creation
  • pretrained checkpoint load into SFT

It also runs a GC / CUDA cache cleanup barrier between the two phases.

Verification

  • python3 -m py_compile tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py
  • git diff --check
  • uvx ruff check tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py
  • uvx ruff format --check tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py

Note: uv run ruff ... is blocked on this local host by the pinned nvidia-resiliency-ext wheel platform tag, so I used isolated uvx ruff for the file-scoped lint checks. GPU validation should come from CI.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 16, 2026
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test b17315a

@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review

LGTM. Clean fix that correctly addresses the OOM issue by shrinking the test model shape while preserving the coverage paths (CP + packing + checkpoint load).

Good details:

  • _set_existing_attr follows the project convention of guarding against phantom setattr on config dataclasses
  • GC + cache cleanup + barrier between pretrain and finetune phases is the right way to reclaim GPU memory in a single-process two-phase test
  • Model dimensions are internally consistent (kv_channels = hidden_size / num_attention_heads = 64)

Suggested test cases

No perf tests impacted.

@yaoyu-33 yaoyu-33 merged commit 839da68 into main Jun 16, 2026
105 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/fix-seqpacking-cp-oom branch June 16, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant